This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
This data contains 4898 observations. There are 11 independent variables and 1 dependent variable in this original data. There is no missing values in this data. And the dependent variable quality is a score between 0 and 10, and the higher the score, the better taste a white wine tastes.
Based on the attribute description for the original data, when free SO2 concentrations is over 50 ppm, SO2 becomes evident in the nose and taste of wine. As for water, 1 ppm = approximately 1 mg/L. And the independent variable free sulfur dioxide is measured in mg/dm^3. So here we can change the measurement to ppm directly without a conversion of its unit. Having this information, I create a new varibale named free.sulfur.50 to store a value “0” if free.sulfur.dioxide has a value no greater than 50 and a value “1” if free.sulfur.dioxide has a value greater than 50.
Before I getting start to explore relathionships among all those variables, I’d like to have a look at each independent and dependednt variable to have a better knowledage of themselves. I make it thrgou their summaries and distributions. I use the summary() function to get their summary statistics. And I also plot histograms of each variable to get a basic idea about their distributions.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality free.sulfur.50
## Min. : 8.00 Min. :3.000 0:4030
## 1st Qu.: 9.50 1st Qu.:5.000 1: 868
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
From the summary, curiously, except pH and alcohol, other independent variables each has a maximum value being far awary from their 3rd quantile. I am curious about whether they are from the same observation. And also there is a quality scored 9. This score means this white wine tastes almost excellent. I also want to have a look at this record.
## [1] "max fixed.acidity appears on line: 1527"
## [1] "max volatile.acidity appears on line: 4040"
## [1] "max citric.acid appears on line: 746"
## [1] "max residual.sugar appears on line: 2782"
## [1] "max chlorides appears on line: 485"
## [1] "max free.sulfur.dioxide appears on line: 4746"
## [1] "max total.sufur.dioxide appears on line: 4746"
## [1] "max density appears on line: 2782"
## [1] "max sulphates appears on line: 4887"
As from the line numbers of each maximun number, only free.sulfur.dioxide and total.sulfur.dioxide appears in the same record, I doubt there might be some relationships between this two variables. I will examine these two varibales later.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 775 775 9.1 0.27 0.45 10.6
## 821 821 6.6 0.36 0.29 1.6
## 828 828 7.4 0.24 0.36 2.0
## 877 877 6.9 0.36 0.34 4.2
## 1606 1606 7.1 0.26 0.49 2.2
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 775 0.035 28 124 0.99700 3.20
## 821 0.021 24 85 0.98965 3.41
## 828 0.031 27 139 0.99055 3.28
## 877 0.018 57 119 0.98980 3.28
## 1606 0.032 31 113 0.99030 3.37
## sulphates alcohol quality free.sulfur.50
## 775 0.46 10.4 9 0
## 821 0.61 12.4 9 0
## 828 0.48 12.5 9 0
## 877 0.36 12.7 9 1
## 1606 0.42 12.9 9 0
From this table, it seesm that a higher score white wine has more alcohol. To get a more detailed information. I use summary() function again to describe above 5 lines so that a comparison could be made with the summary table of whole data.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 775.0 Min. :6.60 Min. :0.240 Min. :0.290
## 1st Qu.: 821.0 1st Qu.:6.90 1st Qu.:0.260 1st Qu.:0.340
## Median : 828.0 Median :7.10 Median :0.270 Median :0.360
## Mean : 981.4 Mean :7.42 Mean :0.298 Mean :0.386
## 3rd Qu.: 877.0 3rd Qu.:7.40 3rd Qu.:0.360 3rd Qu.:0.450
## Max. :1606.0 Max. :9.10 Max. :0.360 Max. :0.490
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 1.60 Min. :0.0180 Min. :24.0 Min. : 85
## 1st Qu.: 2.00 1st Qu.:0.0210 1st Qu.:27.0 1st Qu.:113
## Median : 2.20 Median :0.0310 Median :28.0 Median :119
## Mean : 4.12 Mean :0.0274 Mean :33.4 Mean :116
## 3rd Qu.: 4.20 3rd Qu.:0.0320 3rd Qu.:31.0 3rd Qu.:124
## Max. :10.60 Max. :0.0350 Max. :57.0 Max. :139
## density pH sulphates alcohol
## Min. :0.9897 Min. :3.200 Min. :0.360 Min. :10.40
## 1st Qu.:0.9898 1st Qu.:3.280 1st Qu.:0.420 1st Qu.:12.40
## Median :0.9903 Median :3.280 Median :0.460 Median :12.50
## Mean :0.9915 Mean :3.308 Mean :0.466 Mean :12.18
## 3rd Qu.:0.9906 3rd Qu.:3.370 3rd Qu.:0.480 3rd Qu.:12.70
## Max. :0.9970 Max. :3.410 Max. :0.610 Max. :12.90
## quality free.sulfur.50
## Min. :9 0:4
## 1st Qu.:9 1:1
## Median :9
## Mean :9
## 3rd Qu.:9
## Max. :9
Compared this summary table with that of the whole data, besides the pattern in alocohol. I find that fixed.acidity for this 5 obvervations are around 3rd quantile of the whole data, and so does pH. But as to chlorides for this 5 observations, they are around 1st quantile of the whole data, and so does density. These will be paid attention to when exploring the data later.
In our sample, fixed.acidity is mainly distributed between 6 and 7, with a peak approximately around its 1st quantile 6.3.
In our sample, from this histrogram, volatile.acidity is mainly distributed between 0.15 and 0.35, and its peak is approximately around 0.21. And its 1st quantile is 0.25.
In our sample, from this histrogram, citric.acid is mainly distributed between 0.25 and 0.3, and its peak is approximately around its 1st quantile 0.27.
In our sample, from this histrogram, residual.sugar is mainly less than 10 g /L. The peak appers approximately around 2, and its 1st quantile and mean are 1.7 and 5.2.
In our sample, from this histrogram, chlorides is very centered around its approximately peak value 0.04. This is close to its 1st quantial 0.036 and median 0.043.
In our sample, from this histrogram, free.sulfur.dioxide is mainly less than 50 and fouces between 20 and 40, which is also between its 1st quantile 23 and 3rd quantile 46.
In our sample, from this histrogram, total.sulfur.dioxide is mainly distributed between 90 and 180, and its peak is approximately around 130. This is around its median and mean, which are 134 and 138.
In our sample, most of those white wines have a density less than 1.
In our sample, from this histrogram, pH is mainly distributed between 3.0 and 3.3, and its peak is approximately around 3.2. This is the same as its summary data.
In our sample, from this histrogram, sulphates seems to have two peaks 0.38 and 0.47.
In our sample, from this histrogram, alcohol seems to have 3 chunks. The highest peak is around 9.4 and 9.5 in the first chunk.
We can see that most of our samples have a free.sulfur.dioxide concentration no greater than 50 ppm and the difference in their count in really huge, the height of the left bar is almost 3 times more than that of the right one.
Most of white wines in our sample have a so-so score around 5 to 7. And 6 is the score of most of our sample white wines.
The distribution of pH looks close to a normal one but others are more or less skewed.The dependent variable based on sensory data has an integer value between 0 and 10. Here, most white wine quality scores are 6.
From the summary and histogram above, I am interested in free.sulfur.dioxide, total.sulfur.dioxide, sulphate, alcohol and pH. For the first two, I am not only interested in their individual relationships with the quality, but also interested in how the combination of these two attributes would affect our sample white wine quality.
Besideds above mentioned features I am interested, I think all other features are also important in scoring the white wine as their values are inputs based on physicochemical tests using our samples. As those items are all coming from the sample itself, there is no reason for me to ignore their effect on this sample quality. So in the next part of my exploration, I would like to plot quality with all other varaibles one by one to see whether I can get some insights from those plots.
In our sample data, there are some values are relatively high than other values of the same attribute. And also from the summary part, we know that they are from different observations. At this stage, I have no evidence that these values are absolutely ourliers, so I will just keep them in the data for following analysis without doing anything to them.
As I mentioned in the last part of previous section, in this part, I will focus on exploring relationships between variables, both the relationship between independent variables and independent variables and the relationshp between independent variables and dependent variables.
In the first step, I use ggpairs() to get a scatterplot matrix to get an overview of each pair of variables.
As our attribute names are too long to be fully displayed in a matrix, I made a copy of this data and renamed those names for this plot only without making changes to the original data file.
Here is how the clumn names are renamed: 1 - fixed.acidity: f.a 2 - volatile.acidity: v.a 3 - citric.acid: c.a 4 - residual.sugar: r.s 5 - chlorides: chl 6 - free.sulfur.dioxide: f.s.d 7 - total.sulfur.dioxide: t.s.d 8 - density: den 9 - pH: pH 10 - sulphates: sul 11 - alcohol: alc 12 - quality: qly 13 - free.sulfur.50: fs50
According to the maxtrix, there seems to be a correlation between: 1) fixed.acidity and pH: the plot looks like a band going bottom right 2) fixed.acidity and citric.acid: the plot looks like a horn going upper right 3) total.sulfur.dioxide and free.sulfur.dioxide: the plot looks like a band going upper right 4) density and residual.sugar: the plot looks like a band going upper right 5) density and total.sulfur.dioxide: the plot looks like a band going bottom right 6) density and alcohol: the plot looks like a band going bottom right
Here, relevant correlation values are listed for better view:
## [1] "Relevant Correlation Value of"
## [1] "1) fixed.acidity and pH: -0.425858"
## [1] "2) fixed.acidity and citric.acid: 0.289181"
## [1] "3) total.sulfur.dioxide and free.sulfur.dioxide: 0.615501"
## [1] "4) density and residual.sugar: 0.838966"
## [1] "5) density and total.sulfur.dioxide: 0.529881"
## [1] "6) density and alcohol: -0.780138"
For density things, it is easy for us to understand the relationships. As the density of alcohol is smaller than water, more alcohol will less the white wine density, and this leads to a negative relationship between these two variabls. With a similar reason, residual.sugar and white wine density has a positive relationship.
And based on some documents about wine, the predominant fixed acids found in wines are tartaric, malic, citric, and succinic. And this explains the positive relationship between fixed.acidity and citric.acid.
As to the relationship between fixed.acidity and pH and the relationship between density and total.sulfur.dioxide, I haven’t find any documents which can clearly explain their relationships. But one point I got from wikipedia goes “Generally, the lower the pH, the higher the acidity in the wine. However, there is no direct connection between total acidity and pH (it is possible to find wines with a high pH for wine and high acidity).”
The thing beyond my expectation is the relationship between total.sulfur.dioxide and free.sulfur.dioxide. According to my understanding, based on the attributes description file, their value will have an efect on white wine quality but the exixting of one of them should not have an influence on the other one. This relationship might because of some chemical thing, I am not familar about this part but will exploring their effect on white wine quality based on plots and analysis.
In this part, I will make plots of quality and each other attribute to get a more details about their relationships.
As mentioned above, “Generally, the lower the pH, the higher the acidity in the wine. However, there is no direct connection between total acidity and pH (it is possible to find wines with a high pH for wine and high acidity).”, I arrange plots of fixed.acidity, volatile.acidity, citric.acid and pH in one plot so that it would be easier to do comparison.
We can read from the above image that the quality boxed go up and down and there is no obvioud pattern could be seen from above plots.
From previous plot matrix, we see that there is a relatonship between free.sulfur.dioxide and total.sulfur.dioxdice. In the meantime, as sulphates can contribute to sulfur dioxide gas (S02) levels, I arrange plot of quality and free.sulfur.dioxide, plot of total.sulfur.dioxide and plot of sulphates together for comapriosn and cross check.
I observe no patterns from quality and sulphates as the box almost at the same horizontal line. From the plot of quality and free.sulfur.dioxide, it is noticed that quality less than 5, white wines have lower free.sulfur.dioxide concentration and when the quality is at least 5, boxes become more centered but all have a center at the similar horizontal line. And from the plot of quality and total.sulfur.dioxide, when the quality is less than 5 or larger than 5, quality increases when total.sulfur.dioxide increases, but there is a jump between quality of 4 and quality of 5.
As I mentioned earlier, SO2 becomes evident when free SO2 concentrations is over 50 ppm. To better learn the relationship between quality and both free.sulfur.dioxide and total.sulfur.dioxide, I need to look at a plot of quality and free.sulfur.50.
Quality from 3 to 9 appears on both free.sulfur.dioxide no greater than 50 ppm and free.sulfur.dioxide greater than 50 ppm. But what we can see here is that black dots of free.sulfur.dioxide no greater than 50 ppm are all much larger than those of free.sulfur.dioxide greater than 50 ppm.
As density is related to both residual.sugar and alcohol, I again arrange their plots together for a better view and exploration.
I notice there is a jump in the plot of quality and residul.sugar and the plot of quality and density. Generally, quality increases when residual.sugar or density decrease. However, jump happens when the quality goes from 4 to 5. And for quality and alcohol, when quality increases, the alcohol concentration goes down first and then goes up.
Becasue of the axis limits, some observations are removed from the plot.Looking the plot, higher quality has a lower chlorides concentration, i.e. the relationship between quality and chlorides seems to be negative.
After exploring relationships between each pair of independent variables, I found relationships between fixed.acidity and pH, fixed.acidity and citric.acid, total.sulfur.dioxide and free.sulfur.dioxide, density and residual.sugar, density and total.sulfur.dioxide, and density and alcohol.
For relationships between dependent variable and indenpendent variabls, features I am interested in, free.sulfur.dioxide, total.sulfur.dioxide, sulphate, alcohol, and pH, besides alcohol, I could not find strong support for relationships between them and quality as jumps were seen from those plots. As for alcohol, the quality goes down first and then goes up when alcohol goes from low to high concentration.
For other supportive attributes, the relationship between quality and chlorides is negative, and there are no obvious relationships between quality and the others.
By now, the strongest relationships I have found might between density and residual.sugar or density and alcohol for relationships among independent variables, and between quality and alcohol for relationships between dependent variable and independent variable.
In this part, I will continue to look at how free.sulfur.dioxide and total.sulfur.dioxide affects white wine quality. And I will also want to create a models to see whether it will be the same as our analysis by now by looking at the coefficient of each variable.
In this part, I plot scatterplot of quality and total.sulfur.dioxide based on the free.sulfur.dioxide levels. Level 0 indicates the in our sample, the free.sulfur.dioxide is no greater than 50 ppm. And Level 1 indicates that in our sample, the free.sulfur.dioxide is greater than 50 ppm.
Here I observe and interesting thing. If we only look at all red points, it seems to be a curve and achieve its highest point when the total.sulfur.dioxide has a value of 100 mg/dm^3. The curve increases when total.sulfur.dioxide is smaller than 100 mg/dm^3, and decreases when total.sulfur.dioxide is larger than 100 mg/dm^3. If we only look at all green points, it seems to be a straigh line with a negative slope. So we might say here, when free.sulfur.dioxide is smaller than 50 ppm, and total.sulfur.dioxide is less than 100 mg/dm^3, a increase in total.sulfur.dioxide would improve the white wine quality.
In the first part, I saw that for wine quality scoring 9, 4 of 5 are having a high alcohol. During the later exporation, I found that alcohol is related to density with a relavant correlation value approximately -0.78. This is a strong relationship. So in this part, I would like to know how the combination of alcohol and density would affect quality.
There do have some pattern in this plot, for example, darker boxes always appear at the lower position and ligher boxes always appear on the left of a darker one. But we can also see that the density has some outliers. This plot will be refined later with removing its outlier later for a better exploration.
In this part, I would like to fit the data using a linear regression.
##
## Call:
## lm(formula = quality ~ ., data = wineQW_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5999 -0.4994 -0.0213 0.4643 3.1614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.424e+02 1.873e+01 7.602 3.48e-14 ***
## fixed.acidity 6.487e-02 2.086e-02 3.109 0.00189 **
## volatile.acidity -1.880e+00 1.148e-01 -16.375 < 2e-16 ***
## citric.acid 2.886e-02 9.607e-02 0.300 0.76393
## residual.sugar 7.594e-02 7.540e-03 10.072 < 2e-16 ***
## chlorides -4.935e-01 5.463e-01 -0.903 0.36639
## free.sulfur.dioxide 7.857e-03 1.063e-03 7.388 1.74e-13 ***
## total.sulfur.dioxide -4.148e-04 3.768e-04 -1.101 0.27111
## density -1.424e+02 1.900e+01 -7.491 8.08e-14 ***
## pH 6.824e-01 1.057e-01 6.456 1.18e-10 ***
## sulphates 6.502e-01 1.000e-01 6.500 8.82e-11 ***
## alcohol 1.886e-01 3.704e-02 5.093 3.66e-07 ***
## free.sulfur.501 -2.592e-01 4.059e-02 -6.386 1.86e-10 ***
## alcohol.bucket(9.5,10.4] -1.206e-01 4.032e-02 -2.990 0.00280 **
## alcohol.bucket(10.4,11.4] -9.641e-02 6.225e-02 -1.549 0.12151
## alcohol.bucket(11.4,14.5] -8.778e-03 1.001e-01 -0.088 0.93011
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.747 on 4882 degrees of freedom
## Multiple R-squared: 0.2908, Adjusted R-squared: 0.2886
## F-statistic: 133.5 on 15 and 4882 DF, p-value: < 2.2e-16
From the linear regression formula, we see that Adjusted R-squared is not high as its only 0.286. But the summary table still can give us some idea about how those atrributes would affect white wine quality. Here fixed.acidity, volatile.acidity, residual.sugar, free.sulfur.dioxide, density, pH, sulphates, alcohol and free.sulfur.50 all influence white wine quality. Some of this are consistent with our previous exploration but some are not. This only gives us a sense but not supportive information about their relationships as the low R-squared here.
I kept focusing on features I am interested in in this part. And I am happy to find that there do have some relationships between quality and total.sulfur.dioxide based on free.sulfur.dioxide levels in our white wine sample. And the surprising thing is those points make up a curve and this is not what I thought it would be.
In this part, I also created a linear regression function to see whether it would support my findings. As a result, it partially does, like alcohol, density and free.sulfur.50. But some coefficients are not consistent with my earlier exploration. This might because of my previoud exploration are based on rough plots and I missed some suble but vital information. There is also a possibility that the linear regression created here is not a good fit for this data. As we already observed a curve in one of our relationships, the whole data might be fitted much better using a non-linear regression.
In this part, I would like to choose, refine and share 3 plots I found the most interesting in my previous exploration of the data.
The first plot I would like to share is the scatterplot of quality and total.sulfur.dioxide based on different free.sulfur.dioxide levels. I choose this plot because of the curve, which is unexpected since the very beginning of my exploration.
It is obversed that: 1) when free.sulfur.dioxide is no greater than 50 ppm and total.sulfur.dioxide is smaller than 100 mg/dm^3, white wine quality will increase as the total.sulfur.dioxide increases. 2) when when free.sulfur.dioxide is no greater than 50 ppm and total.sulfur.dioxide is greater than 100 mg/dm^3, white wine quality will decrease as the total.sulfur.dioxide increases. 3) when when free.sulfur.dioxide is greater than 50 ppm, white wine quality will decrease as the total.sulfur.dioxide increases.
The Second plot I would like to share is the plot of quality, alcohol and density. I choose this plot because of the pattern I found in this plot. Also as I mentioned earlier, I will first remove observations with a density larger than 1 from our data. And then replot our data.
It is obversed that: 1) the darker the box color is, the lower the vertical position of the box is. This is what we know from our previous exploration, that when alcohol concentration increases, the density will decrease. 2) the alcohol.bucket from left to right is always from light blue to dark blue. That is when desity is the same or silimiar, a higher alcohol concentration would have a better quality.
As to the last plot, I would like to share the scatter plot of quality and residual.sugar The reason for choosing this plot is interestingly I found a vertical curve in it.
It is obversed that the lines make up by dots seems to be symetric with a center at quality 6. If you imagine there is a line connecting the dark orange dot at the right of each line, you will find its a curve. This is interesting.
After all the exploration, we could make sure that white wine quality is related to alcohol, density, residual.sugar, free.sulfur.dioxide, and total.sulfur.dioxide. And we could also feel that the relationship among quality and other atttributes might not be a linear one.
What makes the exploration a hard thing for me is that when looking at bivariate plots using scatterplot, it is really difficult to observe any patterns. But things getting better when I plotted box plots after converting quality to factor, and doing a multivariate analysis. And supurisingly, non-linear relationships were found among some varaibels.
I think more multivariat analysis like those I did here, the relashionship between quality and total.sulfur.dioxide based on free.sulfur.dioxide level, and the relashionship among quality, alcohol and density, could be done for better analysis.